### **Cache Memories**

CSE 238/2038/2138: Systems Programming

### **Instructor:**

Fatma CORUT ERGİN

Slides adapted from Bryant & O'Hallaron's slides

# **Today**

- Cache memory organization and operation
- Performance impact of caches
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality



# **General Cache Concept**



### **Cache Memories**

- Cache memories are small, fast SRAM-based memories managed automatically in hardware
  - Hold frequently accessed blocks of main memory
- CPU looks first for data in cache
- Typical system structure:



# General Cache Organization (S, E, B)





# **Example: Direct Mapped Cache (E = 1)**

Direct mapped: One line per set Assume: cache block size 8 bytes



# **Example: Direct Mapped Cache (E = 1)**

Direct mapped: One line per set Assume: cache block size 8 bytes



# **Example: Direct Mapped Cache (E = 1)**

Direct mapped: One line per set Assume: cache block size 8 bytes



If tag doesn't match: old line is evicted and replaced

### **Direct-Mapped Cache Simulation**

| t=1 | s=2 | b=1 |
|-----|-----|-----|
| X   | XX  | X   |

M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 Blocks/set

Address trace (reads, one byte per read):

| 0 | $[0000_2],$                          | miss |
|---|--------------------------------------|------|
| 1 | [ <mark>000</mark> 1 <sub>2</sub> ], | hit  |
| 7 | $[0111_2],$                          | miss |
| 8 | [1000 <sub>2</sub> ],                | miss |
| 0 | [00002]                              | miss |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 0   | M[0-1] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 | 1 | 0   | M[6-7] |

# E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: 0...01 t bits 100 1 2 3 4 5 6 tag find set 0 | 1 | 2 | 3 | 4 | 5 | 6 7 0 | 1 | 2 | 3 | tag tag 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | tag 3 4

# E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set



# E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0...01 100 compare both valid? + match: yes = hit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 tag tag block offset

#### No match:

One line in set is selected for eviction and replacement

short int (2 Bytes) is here

Replacement policies: random, least recently used (LRU), ...

### 2-Way Set Associative Cache Simulation

| t=2 | s=1 | b=1 |
|-----|-----|-----|
| XX  | X   | X   |

M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

Address trace (reads, one byte per read):

| 0 | $[00\underline{0}0_{2}],$ | miss |
|---|---------------------------|------|
| 1 | $[0001_{2}],$             | hit  |
| 7 | $[01\underline{1}1_2],$   | miss |
| 8 | [1000 <sub>2</sub> ],     | miss |
| 0 | [0000]                    | hit  |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
|       | 1 | 10  | M[8-9] |
|       |   |     |        |
| Set 1 | 1 | 01  | M[6-7] |
|       | 0 |     |        |

### **Example: Core i7 L1 Data Cache**

32 KB cache 8-way set associative 64 bytes/block 47 bit address range







#### Address of word:



Block offset: ??? bits Set index: ??? bits

Tag: ??? bits

Stack address:

0x00007f7262a1e010

Block offset: 0x??

Set index: 0x??

Tag: 0x??

### **Example: Core i7 L1 Data Cache**







| He | h De | cimal Binary |
|----|------|--------------|
| 0  | 0    | 0000         |
| 2  | 1    | 0001         |
| 2  | 2    | 0010         |
| 3  | 3    | 0011         |
| 4  | 4    | 0100         |
| 5  | 5    | 0101         |
| 6  | 6    | 0110         |
| 7  | 7    | 0111         |
| 8  | 8    | 1000         |
| 9  | 9    | 1001         |
| A  | 10   | 1010         |
| В  | 11   | 1011         |
| С  | 12   | 1100         |
| D  | 13   | 1101         |
| E  | 14   | 1110         |
| F  | 15   | 1111         |

#### Address of word:



Block offset: 6 bits Set index: 6 bits Tag: 35 bits (47-6-6)

### **Stack address:**

0x00007f7262a1e010 S 0000 0001 0000

Block offset: 0x10

Set index: 0x0

Tag: 0x00007f7262a1e

### What about writes?

### Multiple copies of data exist:

- L1, L2, L3, Main Memory, Disk
- What to do on a write-hit?
  - Write-through (write immediately to memory)
  - Write-back (defer write to memory until replacement of line)
    - Need a dirty bit (line different from memory or not)

#### What to do on a write-miss?

- Write-allocate (load into cache, update line in cache)
  - Good if more writes to the location follow
- No-write-allocate (writes straight to memory, does not load into cache)

### Typical

- Write-through + No-write-allocate
- Write-back + Write-allocate

# **Intel Core i7 Cache Hierarchy**

#### Processor package



#### L1 i-cache and d-cache:

32 KB, 8-way, Access: 4 cycles

#### L2 unified cache:

256 KB, 8-way, Access: 10 cycles

#### L3 unified cache:

8 MB, 16-way, Access: 40-75 cycles

Block size: 64 bytes for

all caches.

### **Cache Performance Metrics**

#### Miss Rate

- Fraction of memory references not found in cache (misses / accesses)
   = 1 hit rate
- Typical numbers (in percentages):
  - 3-10% for L1
  - can be quite small (e.g., < 1%) for L2, depending on size, etc.</li>

#### Hit Time

- Time to deliver a line in the cache to the processor
  - includes time to determine whether the line is in the cache
- Typical numbers:
  - 4 clock cycles for L1
  - 10 clock cycles for L2

### Miss Penalty

- Additional time required because of a miss
  - typically 50-200 cycles for main memory (Trend: increasing!)

### **Writing Cache Friendly Code**

- Make the common case go fast
  - Focus on the inner loops of the core functions
- Minimize the misses in the inner loops
  - Repeated references to variables are good (temporal locality)
  - Stride-1 reference patterns are good (spatial locality)

# **Today**

- Cache organization and operation
- Performance impact of caches
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality

### **Matrix Multiplication Example**

### Description:

- Multiply n x n matrices
- Matrix elements are doubles (8 bytes)
- O(n³) total operations
- n reads per source element
- n values summed per destination
  - but may be able to hold in register

```
/* ijk */
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
    for (k=0; k<n; k++)
       sum += a[i][k] * b[k][j];
    c[i][j] = sum;
  }
}

matmult/mm.c</pre>
```

### Miss Rate Analysis for Matrix Multiply

#### Assume:

- Cache block size = 32B (big enough for four doubles)
- Matrix dimension (n) is very large
  - Approximate 1/n as 0.0
- Cache is not even big enough to hold multiple rows

### Analysis Method:

Look at access pattern of inner loop



# Layout of C Arrays in Memory (review)

- C arrays allocated in row-major order
  - each row in contiguous memory locations
- Stepping through columns in one row:

```
for (i = 0; i < n; i++)
sum += a[0][i];</pre>
```

- accesses successive elements
- if block size (B) > sizeof(a<sub>ii</sub>) bytes, exploit spatial locality
  - miss rate = sizeof(a<sub>ii</sub>) / B
- Stepping through rows in one column:

```
for (i = 0; i < n; i++)
sum += a[i][0];</pre>
```

- accesses distant elements
- no spatial locality!
  - miss rate = 1 (i.e. 100%)

# Matrix Multiplication (ijk)

```
/* ijk */
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
    for (k=0; k<n; k++)
       sum += a[i][k] * b[k][j];
    c[i][j] = sum;
  }
}

matmult/mm.c</pre>
```

```
Inner loop:

(*,j)

(i,*)

B

C

T

Row-wise Column-wise Fixed
```

### Misses per inner loop iteration:

| <u>A</u> | <u>B</u> | <u>C</u> |
|----------|----------|----------|
| 0.25     | 1.0      | 0.0      |

# Matrix Multiplication (jik)

```
/* jik */
for (j=0; j<n; j++) {
  for (i=0; i<n; i++) {
    sum = 0.0;
    for (k=0; k<n; k++)
        sum += a[i][k] * b[k][j];
    c[i][j] = sum
  }
}

matmult/mm.c</pre>
```

### Inner loop:



### Misses per inner loop iteration:

| <u>A</u> | <u>B</u> | <u>C</u> |
|----------|----------|----------|
| 0.25     | 1.0      | 0.0      |

# **Matrix Multiplication (kij)**

```
/* kij */
for (k=0; k<n; k++) {
  for (i=0; i<n; i++) {
    r = a[i][k];
    for (j=0; j<n; j++)
        c[i][j] += r * b[k][j];
  }
}
    matmult/mm.c</pre>
```



### Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 0.0 0.25 0.25

# **Matrix Multiplication (ikj)**

```
/* ikj */
for (i=0; i<n; i++) {
  for (k=0; k<n; k++) {
    r = a[i][k];
  for (j=0; j<n; j++)
    c[i][j] += r * b[k][j];
  }
}
matmult/mm.c</pre>
```



### Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 0.0 0.25 0.25

# **Matrix Multiplication (jki)**

```
/* jki */
for (j=0; j<n; j++) {
  for (k=0; k<n; k++) {
    r = b[k][j];
    for (i=0; i<n; i++)
        c[i][j] += a[i][k] * r;
  }
}

matmult/mm.c</pre>
```



### Misses per inner loop iteration:

| <u>A</u> | <u>B</u> | <u>C</u> |
|----------|----------|----------|
| 1.0      | 0.0      | 1.0      |

### **Matrix Multiplication (kji)**

```
/* kji */
for (k=0; k<n; k++) {
  for (j=0; j<n; j++) {
    r = b[k][j];
    for (i=0; i<n; i++)
        c[i][j] += a[i][k] * r;
  }
}
    matmult/mm.c</pre>
```



### Misses per inner loop iteration:

| <u>A</u> | <u>B</u> | <u>C</u> |
|----------|----------|----------|
| 1.0      | 0.0      | 1.0      |

### **Summary of Matrix Multiplication**

```
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
   sum = 0.0;
   for (k=0; k< n; k++)
     sum += a[i][k] * b[k][j];
   c[i][j] = sum;
for (k=0; k< n; k++) {
 for (i=0; i<n; i++) {
 r = a[i][k];
 for (j=0; j<n; j++)
   c[i][j] += r * b[k][j];
for (j=0; j<n; j++) {
 for (k=0; k< n; k++) {
   r = b[k][j];
   for (i=0; i<n; i++)
   c[i][j] += a[i][k] * r;
```

### ijk (& jik):

- 2 loads, 0 stores
- misses/iter = **1.25**

#### kij (& ikj):

- 2 loads, 1 store
- misses/iter = **0.5**

### jki (& kji):

- 2 loads, 1 store
- misses/iter = **2.0**

### **Core i7 Matrix Multiply Performance**



# **Today**

- Cache organization and operation
- Performance impact of caches
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality

### **Example: Matrix Multiplication**



# **Cache Miss Analysis**

#### Assume:

- Matrix elements are doubles
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>

### First iteration:

- n/8 + n = 9n/8 misses

Afterwards in cache: (schematic)



n

# **Cache Miss Analysis**

#### Assume:

- Matrix elements are doubles
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>

#### Second iteration:

Again: n/8 + n = 9n/8 misses



#### Total misses:

- 9n/8 \* n<sup>2</sup> = (9/8) \* n<sup>3</sup>

n

# **Blocked Matrix Multiplication**

```
c = (double *) calloc(sizeof(double), n*n);
/* Multiply n x n matrices a and b */
void mmm(double *a, double *b, double *c, int n) {
    int i, j, k;
    for (i = 0; i < n; i+=B)
       for (i = 0; i < n; i+=B)
             for (k = 0; k < n; k+=B)
                /* B x B mini matrix multiplications */
                  for (i1 = i; i1 < i+B; i++)
                      for (j1 = j; j1 < j+B; j++)
                          for (k1 = k; k1 < k+B; k++)
                              c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1];
                                                         matmult/bmm.c
```



### **Cache Miss Analysis**

#### Assume:

- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Three blocks fit into cache: 3B<sup>2</sup> < C</p>

### First (block) iteration:

- B<sup>2</sup>/8 misses for each block
- 2n/B \* B²/8 = nB/4 (omitting matrix c)

Afterwards in cache (schematic)



n/B blocks

# **Cache Miss Analysis**

#### Assume:

- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Three blocks fit into cache: 3B<sup>2</sup> < C</p>

### Second (block) iteration:

- Same as first iteration
- 2n/B \* B<sup>2</sup>/8 = nB/4



#### Total misses:

 $\blacksquare$  nB/4 \* (n/B)<sup>2</sup> = n<sup>3</sup>/(4B)

n/B blocks

# **Blocking Summary**

- No blocking: (9/8) \* n³
- Blocking: 1/(4B) \* n³
- Suggest largest possible block size B, but limit 3B<sup>2</sup> < C!
- Reason for dramatic difference:
  - Matrix multiplication has inherent temporal locality:
    - Input data: 3n<sup>2</sup>, computation 2n<sup>3</sup>
    - Every array elements used O(n) times!
  - But program has to be written properly

# **Cache Summary**

Cache memories can have significant performance impact

- You can write your programs to exploit this!
  - Focus on the inner loops, where bulk of computations and memory accesses occur.
  - Try to maximize spatial locality by reading data objects with sequentially with stride 1.
  - Try to maximize temporal locality by using a data object as often as possible once it's read from memory.